4 research outputs found
Topic identification using filtering and rule generation algorithm for textual document
Information stored digitally in text documents are seldom arranged according to specific topics. The necessity to read whole documents is time-consuming and decreases the interest
for searching information. Most existing topic identification methods depend on occurrence
of terms in the text. However, not all frequent occurrence terms are relevant. The term
extraction phase in topic identification method has resulted in extracted terms that might have
similar meaning which is known as synonymy problem. Filtering and rule generation
algorithms are introduced in this study to identify topic in textual documents. The proposed filtering algorithm (PFA) will extract the most relevant terms from text and solve synonym roblem amongst the extracted terms. The rule generation algorithm (TopId) is proposed to
identify topic for each verse based on the extracted terms. The PFA will process and filter
each sentence based on nouns and predefined keywords to produce suitable terms for the
topic. Rules are then generated from the extracted terms using the rule-based classifier. An experimental design was performed on 224 English translated Quran verses which are related to female issues. Topics identified by both TopId and Rough Set technique were compared and later verified by experts. PFA has successfully extracted more relevant terms compared to other filtering techniques. TopId has identified topics that are closer to the topics from experts with an accuracy of 70%. The proposed algorithms were able to extract relevant terms without losing important terms and identify topic in the verse
Topic identification method for textual document
Abstract— Topic identification is a crucial task
for discovering knowledge from textual document.
Existing methods for topic identification suffer
from word counting problem as they depend on the most frequent terms in the text to produce the
topic keyword.Not all frequent terms are relevant.
This paper proposes a topic identification method
that filters the important terms from the preprocessed text and applied term weighting
scheme to solve synonym problem.A rule generation algorithm is used to determine the appropriate topics based on the weighted terms.The text document used in the experiment is the English translated Quran.The topics identified from the proposed method were compared with topics identified using Rough Set and domain experts. From the findings, the proposed topic identification method was consistently able to
identify topics that are mostly close to the topics that have been given by Rough Set and the
experts.The result from the comparison proved
that the proposed method was able to be used to
capture topics for textual documents
Rule-based filtering algorithm for textual document
Textual document is usually in unstructured form and high dimensional data.The exploration of hidden information from the unstructured text is useful to find interesting patterns and valuable knowledge.However, not all terms in the text are
relevant and can lead to misclassification. Improper filtration might cause terms that have similar meaning to be removed.Thus, to reduce the high-dimensionality of text, this study proposed a filtering algorithm that is able to filter the important terms from the pre-processed text and applied term weighting scheme to solve synonym problem which will help the selection of relevant term.The proposed filtering algorithm utilizes a keyword library that contained special terms which is developed to ensure that important terms are not eliminated during filtration process.The performance of the proposed filtering algorithm is compared with rough set attribute reduction (RSAR) and information retrieval (IR) approaches.From the experiment, the proposed filtering algorithm has outperformed both RSAR and IR in terms of extracted relevant terms
A subject identification method based on term frequency technique
The analyzing and extracting important information from a text document is crucial and has produced interest in the area of text mining and information retrieval. This process is used in order to notice particularly in the text. Furthermore, on view of the readers that people tend to read almost everything in text documents to find some specific information. However, reading a text document consumes time to complete and additional time to extract information. Thus,
classifying text to a subject can guide a person to find relevant information. In this paper, a subject identification method which is based on term frequency to categorize groups of text into a particular subject is proposed. Since term frequency tends to ignore the semantics of a document, the term extraction algorithm is introduced for improving the result of the
extracted relevant terms from the text. The evaluation of the extracted terms has shown that the proposed method is exceeded other extraction techniques